Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem
نویسندگان
چکیده
Consider the problem of sampling sequentially from a finite number of N > 2 populations, specified by random variables X i k, i = 1, . . . ,N, and k = 1,2, . . .; where X i k denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i, {X i k}k>1 is a sequence of i.i.d. normal random variables, with unknown mean μi and unknown variance σ2 i . The objective is to have a policy π for deciding from which of the N populations to sample from at any time t = 1,2, . . . so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information of the parameters μi and σ2 i . In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996b). Additionally, finite horizon regret bounds are given1.
منابع مشابه
On Bayesian Upper Confidence Bounds for Bandit Problems
Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter is a deterministic unknown quantity, and a Bayesian approach, where the parameter is drawn from a prior distribution. We show in this paper that methods derived from this second perspective prove optimal when evaluated using the frequentist cumulated regret as a measure of perf...
متن کاملOptimality of Thompson Sampling for Gaussian Bandits Depends on Priors
In stochastic bandit problems, a Bayesian policy called Thompson sampling (TS) has recently attracted much attention for its excellent empirical performance. However, the theoretical analysis of this policy is difficult and its asymptotic optimality is only proved for one-parameter models. In this paper we discuss the optimality of TS for the model of normal distributions with unknown means and...
متن کاملMATHEMATICAL ENGINEERING TECHNICAL REPORTS Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors
In stochastic bandit problems, a Bayesian policy called Thompson sampling (TS) has recently attracted much attention for its excellent empirical performance. However, the theoretical analysis of this policy is difficult and its asymptotic optimality is only proved for one-parameter models. In this paper we discuss the optimality of TS for the model of normal distributions with unknown means and...
متن کاملRegret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits
I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Git...
متن کاملAlgorithms for Linear Bandits on Polyhedral Sets
We study stochastic linear optimization problem with bandit feedback. The set of arms take values in an N -dimensional space and belong to a bounded polyhedron described by finitely many linear inequalities. We provide a lower bound for the expected regret that scales as Ω(N log T ). We then provide a nearly optimal algorithm that alternates between exploration and exploitation intervals and sh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1504.05823 شماره
صفحات -
تاریخ انتشار 2015